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Abstract 

Three  dimensional  (3D)  graphics  applications  have  become  very  important  workloads 
running  on  today’s  computer  systems.  A  cost-effective  graphics  solution  is  to  perform 
geometry  processing  of  3D  graphics  on  the  host  CPU  and  have  specialized  hardware 
handle  the  rendering  task.  In  this  paper,  we  analyze  microarchitecture  and  SIMD  in¬ 
struction  set  enhancements  to  a  RISC  superscalar  processor  for  exploiting  parallelism 
in  geometry  processing  for  3D  computer  graphics. 

Our  results  show  that  3D  geometry  processing  has  inherent  parallelism.  Adding  SIMD 
operations  improves  performance  from  8%  to  28%  on  a  4-issue  dynamically  scheduled 
processor  that  can  issue  at  most  2  floating-point  operations.  In  comparison,  an  8-issue 
processor,  ignoring  cycle  time  effects,  can  achieve  20%  to  60%  performance  improve¬ 
ment  over  a  4-issue.  If  processor  cycle  time  scales  with  the  number  of  ports  to  the  regis¬ 
ter  file,  then  doubling  only  the  floating-point  issue  width  of  a  4-issue  processor  with 
SIMD  instructions  gives  the  best  performance  among  the  architectural  configurations 
that  we  examine  (the  most  aggressive  configuration  is  an  8-issue  processor  with  SIMD 
instructions). 


Index  terms:  3D  graphics,  geometry  pipeline,  superscalar  processors,  SIMD  instructions,  paired- 
single  instructions 


This  work  supported  in  part  by  NSF  CAREER  Award  MIP-97 -02547,  DARPA  Grant  DABT63-98-1-0001,  NSF  Grants 
CDA-97-2637  and  CDA-95-12356,  Duke  University,  and  an  equipment  donation  through  Intel  corporation's  Technology 
for  Education  2000  Program.  The  views  and  conclusions  contained  herein  are  those  of  the  authors  and  should  not  be  in¬ 
terpreted  as  necessarily  representing  the  official  policies  or  endorsements,  either  expressed  or  implied,  of  the  U.S.  Gov¬ 
ernment 


1 


Report  Documentation  Page 

Form  Approved 

OMB  No.  0704-0188 

Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 

VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  OMB  control  number. 

1.  REPORT  DATE 

2005 

2.  REPORT  TYPE 

3.  DATES  COVERED 

4.  TITLE  AND  SUBTITLE 

5a.  CONTRACT  NUMBER 

Exploiting  Parallelism  in  Geometry  Processing  with  General  Purpose 

5b.  GRANT  NUMBER 

riuECbsun  ttiiu  riuauiig-ruiiii  msu  utuuin 

5c.  PROGRAM  ELEMENT  NUMBER 

6.  AUTHOR(S) 

5d.  PROJECT  NUMBER 

5e.  TASK  NUMBER 

5f.  WORK  UNIT  NUMBER 

7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

Defense  Advanced  Research  projects  Agency, 3701  North  Fairfax 

Drive, Arlington, VA, 22203-1714 

8.  PERFORMING  ORGANIZATION 

REPORT  NUMBER 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

10.  SPONSOR/MONITOR'S  ACRONYM(S) 

11.  SPONSOR/MONITOR'S  REPORT 
NUMBER(S) 

12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  unlimited 

13.  SUPPLEMENTARY  NOTES 

14.  ABSTRACT 

see  report 

15.  SUBJECT  TERMS 

16.  SECURITY  CLASSIFICATION  OF: 

17.  LIMITATION  OF 
ABSTRACT 

18.  NUMBER 
OF  PAGES 

30 

19a.  NAME  OF 
RESPONSIBLE  PERSON 

a.  REPORT 

unclassified 

b.  ABSTRACT 

unclassified 

c.  THIS  PAGE 

unclassified 

Standard  Form  298  (Rev.  8-98) 

Prescribed  by  ANSI  Std  Z39-18 


1  Introduction 

The  increasing  number  of  multi-media  applications  produces  a  commensurate  increase  in  demand 
for  cost-effective  multi-media  processing  [11].  Traditionally,  media  processing  was  implemented  in 
expensive  custom  hardware  specialized  for  specific  applications  (e.g.,  speech,  video,  and  graphics). 
Advances  in  conventional  microprocessor  design  now  permit  offloading  some  functionality  to  a  gen¬ 
eral-purpose  processor,  possibly  sacrificing  performance  in  return  for  reduced  cost.  The  key  is  to 
minimize  this  performance  degradation,  potentially  by  adding  architectural  support  for  media  process¬ 
ing. 

Many  current  microprocessors  have  Single  Instruction  Multiple  Data  (SIMD)  type  instructions  to 
accelerate  audio,  video  and  2D  image  processing,  such  as  Intel  MMX  [16],  Sun  UltraSPARC  VIS  [10] 
and  HP  PA-RISC  [8].  This  type  of  SIMD  instruction  operates  only  on  integer  data.  Today,  several 
processor  vendors,  such  MIPS  Technology  Inc.  [15],  Cyrix,  IDT,  AMD  [2],  Intel  [7],  and  Motorola 
[19]  are  in  various  stages  of  incorporating  floating-point  SIMD  instructions  to  speedup  geometry  proc¬ 
essing  for  three  dimensional  (3D)  graphics. 

Typically,  3D  graphics  processing  is  a  3-stage  pipeline  [5]:  1)  database  traversal,  2)  geometry  com¬ 
putation,  and  3)  rasterization.  Display  models  representing  graphics  scenes  are  generally  stored  in  a 
database  that  must  be  traversed  (stage  1)  to  extract  the  appropriate  information  for  display,  such  as  the 
drawing  primitive  (e.g.,  line  or  triangle),  lighting  models,  etc.  The  information  is  then  passed  to  the 
geometry  subsystem  (stage  2),  which  is  responsible  for  transforming  3D  coordinates  to  2D  coordi¬ 
nates.  Finally,  the  rasterization  stage  (stage  3)  converts  transformed  primitives  into  pixel  values  and 
stores  them  in  the  frame  buffer  for  display. 

In  high-end  graphics  systems  [17]  [18],  the  host  CPU  is  only  responsible  for  database  traversal,  and 
custom  hardware  is  used  for  geometry  processing  and  rasterization.  The  cost  of  building  these  high- 
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end  systems  is  generally  too  high  for  the  mass  market.  To  reduce  cost,  the  host  CPU  could  execute 
some,  or  all,  of  the  graphics  pipeline.  This  paper  focuses  specifically  on  host  CPU  execution  of 
geometry  computation  using  a  single  dynamically  scheduled  superscalar  microprocessor. 

Geometry  computation  is  floating-point  intensive.  Vertex  coordinates,  color  and  transformation  ma¬ 
trices  are  stored  in  single -precision  floating-point  format.  Previous  studies  [1]  have  shown  that  90 
floating-point  arithmetic  operations  are  required  to  process  a  single  vertex.  Current  superscalar  proces¬ 
sors  can  issue  2  floating-point  operations  per  cycle.  The  above  analysis  implies  that  a  500  MHz  proc¬ 
essor  could  theoretically  process  1 1  million  vertices  per  second.  This  value  is  close  to  the  computing 
capability  of  today’s  specialized  hardware  [18].  However,  because  of  instruction  scheduling  and  re¬ 
source  limitations,  a  general  purpose  processor  is  unlikely  to  achieve  this  theoretical  rate.  The  goal  of 
this  paper  is  to  examine  the  performance  of  geometry  processing  on  a  general  purpose  processor  and 
evaluate  the  benefits  of  recently  proposed  instruction  set  enhancements. 

Geometry  processing  is  an  inherently  parallel  task,  since  each  object  vertex  can  be  processed  inde¬ 
pendently.  Dynamically  scheduled  processors  can  exploit  this  parallelism  by  looking  ahead  in  the  in¬ 
struction  stream  to  identify  and  execute  the  operations  associated  with  different  vertices.  Recall  that 
vertex  computations  require  only  32-bit  floating-point  values.  Since  most  modern  microprocessors 
have  64-bit  floating-point  registers,  geometry  calculations  using  32-bit  operands  are  utilizing  only  half 
the  floating  -point  datapath  (registers,  functional  units,  and  busses).  Another  way  to  exploit  this  paral¬ 
lelism  is  using  SIMD  type  instructions  to  perform  operations  on  multiple  vertices  in  one  instruc¬ 
tion — called  paired-single  instructions.  Paired-single  instructions  fully  utilize  the  64-bit  datapath  by 
performing  two  independent  32-bit  operations,  each  using  half  the  datapath. 

As  mentioned  above,  most  processor  vendors  are  incorporating  paired-single  instructions.  AMD’s 
3DNow!  Technology  [2]  is  currently  available.  However,  we  are  unaware  of  any  published  quantita- 
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tive  evaluation  of  their  performance  using  full  graphics  applications.  Parthasarathy  et  al.  performs  de¬ 
tailed  performance  evaluation  for  Sun  VIS  instruction  set  but  they  focus  only  on  image  and  video 
processing  applications  [23].  In  this  paper,  we  simulate  Viewperf  [20],  an  industry  standard  bench¬ 
mark  suite,  on  an  out-of-order  superscalar  processor  both  with  and  without  paired-single  instructions. 
We  modified  the  geometry  computation  routines  in  MESA  [13]  (a  public  domain  implementation  of 
OpenGL  [24])  to  utilize  paired-single  instructions.  We  first  analyze  the  effects  of  increasing  the  re¬ 
sources  available  in  a  conventional  processor.  This  is  followed  by  a  comparison  to  paired-single  exe¬ 
cution,  both  with  and  without  clock  cycle  time  effects. 

The  contributions  of  this  paper  are  as  follows: 

1.  Although  geometry  processing  presents  substantial  parallelism,  we  discover  that  certain  aspects  of 
application  implementations  can  significantly  impact  the  available  parallelism  that  can  be  ex¬ 
ploited  by  a  superscalar  processor.  In  the  best  case,  an  8-way  issue  processor  can  achieve  60% 
performance  improvement  over  a  4-way  with  a  64-entry  dispatch  queue  and  128  registers,  but  for 
certain  benchmarks,  the  performance  only  increases  by  20%.  Furthermore,  if  the  CPU  cycle  time 
scales  with  the  number  of  ports  to  the  register  file,  the  performance  improvement  is  less  than  5% 
for  all  the  benchmarks. 

2.  We  analyze  the  effect  of  adding  paired- single  instructions  on  a  set  of  industry  standard  3D  graph¬ 
ics  benchmarks  instead  of  small  kernels.  We  found  that  the  performance  improvement  from  pair¬ 
ing  up  single -precision  floating-point  operations  ranges  from  8%  to  28%  on  a  4-way  issue  proces¬ 
sor  that  can  issue  at  most  2  floating-point  operations  per  cycle. 

3.  We  quantify  the  benefits  of  paired-single  instructions  over  increasing  only  the  floating-point  issue 
width  in  superscalar  processors.  Our  results  indicate  that  adding  paired-single  instructions  to  4- 
issue  processor  performs  within  7%  of  doubling  the  floating-point  issue  width.  For  certain  bench- 
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marks,  the  former  even  outperforms  the  latter.  The  performance  advantage  of  paired-single  instruc¬ 
tions  increases  when  considering  the  clock  cycle  time  effect. 

The  remainder  of  this  paper  is  organized  as  follows.  Section  2  provides  background  information  on 
geometry  computation,  and  presents  benchmark  characteristics.  We  review  paired-single  instructions 
in  Section  3.  Section  4  presents  our  simulation  infrastructure,  and  Section  5  presents  our  simulation 
results.  In  Section  6,  we  compare  the  performance  achieved  by  a  general  processor  v.s.  a  high-end 
graphics  system.  Finally,  Section  7  concludes  the  paper. 

2  Background 

To  understand  the  architectural  aspects  of  geometry  processing,  we  first  describe  the  six  stages  of 
the  3D  geometry  pipeline.  Then  we  characterize  a  set  of  OpenGL  performance  evaluation  benchmarks 
(Viewperf  [20]). 

2.1  The  Geometry  Pipeline 

3D  applications  usually  use  polygonal  primitives  (e.g.,  triangles)  to  represent  objects  in  an  applica¬ 
tion  database.  These  primitives  are  represented  in  their  own  coordinate  space.  How  those  objects  are 
displayed  on  the  screen  is  determined  by  the  following  factors:  the  positions  and  orientations  of  ob¬ 
jects  in  the  scene,  the  viewpoint,  the  surface  properties  of  objects,  and  the  light  sources.  Geometry 
processing  transforms  primitives  from  the  object  coordinates  to  the  screen  coordinates,  and  calculates 
the  color  for  each  vertex  according  to  the  object  surface  and  light  properties.  Geometry  processing 
only  operates  on  vertices.  The  rasterization  stage  takes  those  transformed  vertices  and  fills  in  the  inte¬ 
riors  of  polygons. 

Similar  to  the  overall  3D  computation,  geometry  processing  can  be  divided  into  a  set  of  pipeline 
stages.  In  a  typical  geometry  pipeline,  there  are  six  stages  as  shown  in  Figure  1: 
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Figure  1 :  3D  geometry  pipeline. 


View  and  model  transformation:  graphics  primitives  (e.g.,  line,  triangle  or  polygon)  are  trans¬ 
formed  to  the  viewer’s  frame  of  reference.  Transformations  involve  vector  matrix  multiplication  on 
either  1x4,  4x4  or  1x3,  3x3  vector  and  matrix  sizes. 

Lighting:  the  light  position,  color  and  material  properties  are  used  to  calculate  the  object  color. 

Projection  transformation:  this  stage  determines  how  objects  are  projected  to  the  screen.  This 
again  requires  multiplication  of  a  1x4  vector  and  a  4x4  matrix. 

Clipping:  objects  are  clipped  to  the  viewable  area  to  avoid  unnecessary  rendering. 

Division  by  w:  the  x,  y,  z  components  of  each  vertex  are  divided  by  its  w  component.  Geometry 
processing  usually  works  in  the  homogenous  coordinate  system,  where  all  the  vertices  are  represented 
with  four  coordinates  (x,  y,  z,  w).  With  coordinate  positions  expressed  in  the  homogenous  form,  the 
transformations  (i.e.,  viewing,  modeling  and  projection  transformation)  can  be  simplified  and  per¬ 
formed  as  matrix  multiplications  [6].  The  w  component  is  initially  set  to  one.  After  applying  the  pro¬ 
jection  transformation,  it  may  not  equal  to  one.  We  then  perform  the  division  to  get  (x/w,  y/w,  z/w), 
which  are  the  Cartesian  coordinates  of  the  homogeneous  point. 

Mapping  vertex  coordinates  to  screen  coordinates:  vertices  are  mapped  to  the  screen  coordinates. 

Note  that  lighting  (stage  2)  is  optional.  For  those  applications  that  only  perform  wireframe  render¬ 
ing  or  implement  a  global  illumination  algorithm  (i.e.,  the  color  of  each  vertex  is  precomputed),  the 
lighting  stage  is  unnecessary.  However,  the  other  5  stages  are  mandatory. 

.  The  remainder  of  this  section  begins  this  investigation  by  characterizing  a  set  of  3D  applications. 
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2.2  Benchmark  Characterization 

To  characterize  the  architectural  aspects  of  3D  applications,  we  used  ATOM’S  pixie  tool  [3]  to  ana¬ 
lyze  the  Viewperf  OpenGL  performance  evaluation  benchmarks  [20].  OpenGL  is  an  API  for  graphics 
hardware  initially  defined  by  Silicon  Graphics  [26].  We  use  Mesa  [13],  a  public-domain  software  im¬ 
plementation  of  the  OpenGL  specification,  in  this  study.  Mesa  contains  a  complete  software  imple¬ 
mentation  of  the  rendering  pipeline,  allowing  OpenGL  applications  to  execute  on  machines  without 
specialized  graphics  hardware. 

The  Viewperf  suite  contains  five  different  graphic  model  sets  including  CAID  (Computer  Aided  In¬ 
dustrial  Design)  and  digital  content  creation  models.  Each  set  has  seven  to  ten  tests  using  different 
OpenGL  primitives,  lighting  models  and  rendering  parameters.  In  this  section,  we  characterize  three 
different  aspects  of  the  Viewperf  benchmark  set:  1)  the  dynamic  instruction  distribution,  2)  the  aver¬ 
age  number  of  vertices  per  glBegin/glEnd  pair  and  3)  the  amount  of  execution  time  spent  in  the  vari¬ 
ous  geometry  pipeline  stages. 

Dynamic  Instruction  Distribution 

The  dynamic  instruction  distribution  of  the  Viewperf  benchmarks  (average  over  the  five  different 
benchmarks)  indicates  that  42.5%  of  all  of  the  instructions  executed  by  the  geometry  routines  involve 
single-precision  floating-  point  instructions.  A  significant  amount  of  integer  instructions  are  needed  for 
executing  mode  changes  (e.g.  using  different  texture  file  or  changing  the  lighting  model).  The  four 
most  frequently  executed  instructions  are  load  (13.4%),  multiply  (12.2%),  add  (9.7%)  and  store  (6.8%) 
for  single-precision  floating-point  data.  Most  of  the  load  instructions  come  from  loading  the  transform 
matrices  and  vertices.  Similarly,  the  store  instructions  are  used  to  save  the  processed  vertices  back  to 
memory.  The  multiply  and  add  instructions  are  primarily  from  the  transform  and  lighting  operations. 
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vl  v3 

glBegin  (GL_TRIAN GLE_S TRIP) 
glVertex3fv(xO,yO,zO);  /*  coordinates  for  vertex  vO*/ 
glVertex3fv(xl,yl,zl);  /*  coordinates  for  vertex  vl*/ 
glVertex3fv(x2,y2,z2);  /*  coordinates  for  vertex  v2 */ 
glVertex3fv(x3,y3,z3);  /*  coordinates  for  vertex  v3*/ 
glVertex3fv(x4,y4,z4);  /*  coordinates  for  vertex  v4*/ 
glEnd(); 

Figure  2:  Example  of  using  GL  TRIANGLE  STRIP  primitive. 


Average  Number  of  Vertices  per  glBegin/glEnd  Pair 

OpenGL  implements  ten  drawing  primitives  (e.g.,  GL_LINES,  GL_TRIANGLES  and 
GL_POLYGON).  To  draw  an  object,  a  set  of  vertices  are  bracketed  between  a  call  to  glBegin()  and 
glEnd().  The  argument  passed  to  glBegin()  determines  which  geometric  primitive  is  constructed  from 
the  vertices.  3D  surfaces  are  usually  broken  down  into  triangles.  The  most  efficient  way  for  drawing  a 
series  of  triangles  that  are  connected  to  each  other  is  using  the  GL_TRIANGLE_STRIP  primitive  (as 
shown  in  Figure  2).  However,  some  3D  content  creation  applications  do  not  store  objects  in  a  format 
amenable  to  this  drawing  method.  In  this  case,  the  OpenGL  viewing  applications  may  have  to  invoke 
a  drawing  primitive  for  each  triangle.  Thus,  the  number  of  vertices  per  glBegin/glEnd  will  be  small. 
Profiling  results  show  that  the  average  number  of  vertices  per  glBegin/glEnd  pair  varies  across  the 
Viewperf  benchmark.  Awadvs  uses  the  GL_POLYGON  primitive  and  has  only  3.4  vertices  on  aver¬ 
age,  while  some  of  the  CDRS  tests  use  the  GL_TRIANGLE_STRIP  primitive  and  have  up  to  400  ver¬ 


tices  per  glBegin/glEnd  pair. 


There  are  four  ways  to  exploit  parallelism  in  geometry  computation:  1)  processing  individual  com¬ 
ponents  of  a  vertex  (e.g.,  coordinate  (x,  y,  z,  w)  or  color  (R,  G,  B,  A)),  2)  processing  multiple  vertices 
of  each  primitive  within  the  same  pipeline  stage,  3)  processing  vertices  of  each  primitive  in  different 
pipeline  stages  and  4)  processing  different  primitives.  In  the  MESA  implementation,  the  computations 
for  vertices  of  each  primitive  (i.e.,  those  vertices  bracketed  by  glBegin/glEnd)  in  the  same  pipeline 
stage  are  performed  in  loops.  Several  internal  library  routines  are  executed  before  starting  the  next 
stage  or  a  new  set  of  geometry  drawings. 

A  superscalar  processor  that  can  only  exploit  ILP  from  instructions  stored  in  the  dispatch  queue  is 
more  likely  to  exploit  the  parallelism  in  the  first  two  scenarios.  A  small  number  of  vertices  between 
glBegin/glEnd  indicates  that  fewer  independent  floating-point  instructions  can  be  issued  close  in  time. 
Thus,  we  do  not  expect  benchmarks  with  very  small  number  of  vertices  on  average  per  glBegin/glEnd 
pair  to  achieve  IPC  as  high  as  benchmarks  with  a  large  number  of  vertices,  unless  a  very  large  dispatch 
queue  is  used. 

Execution  Time  Distribution  of  the  Geometry  Pipeline 

We  divide  the  execution  time  for  the  geometry  pipeline  into  five  portions: 

Light  (gl  color  shade  vertices):1  This  portion  corresponds  to  the  lighting  stage,  which  calculates 
the  color  for  each  vertex. 

XformV  (gl_xform_normals_4fv):  This  portion  includes  the  vertex  transformation  of  both  the 
viewing/modeling  and  projection  transform  stages.  It  performs  multiplication  of  a  matrix  by  a  vector. 

XformN  (gl_xform_normals_3fv):  This  portion  includes  the  normal  vector  transformation  in  the 
viewing/modeling  transform  stages. 


1  The  corresponding  routine  name  in  the  Mesa  implementation  is  listed  in  parenthesis. 
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Figure  3-  Execution  time  distribution  in  the  MESA  geometry  pipeline. 


Div  by  w/Map  (gl_transform_vb_part2):  This  portion  includes  the  computation  of  div  by  w  and 
mapping  vertex  stages.  It  selects  the  appropriate  lighting  routine  (e.g.  line,  polygon,  and  type  of  shad¬ 
ing)  and  calls  the  fog,  texture,  and  clipping  routines  before  finally  projecting  the  primitives  to  screen 
coordinates. 

Other:  This  portion  includes  the  clipping  stage  and  the  library  routines  executed  between  different 
pipeline  stages  and  drawing  primitives. 

As  shown  in  Figure  3,  Light  and  XformV  are  the  two  portions  where  geometry  processing  spends 
the  most  time.  Note  that  the  Light  benchmark  gets  its  name  because  each  vertex  color  is  pre-computed 
using  a  global  illumination  algorithm,  therefore,  it  does  not  actually  execute  the  lighting  functions. 
Awadvs  spends  almost  15%  of  the  execution  time  in  the  routines  executed  between  different  pipeline 
stages  and  drawing  primitives  as  indicated  by  Other  in  Figure  3.  This  is  significantly  higher  than  the 
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Figure  4:  Operation  of  paired-single  multiply. 


Instruction  Format 

Latency(cycle) 

LDPS  dest,  index(base) 

2 

STPS  src,  index(base) 

2 

PMUL  srcl,  src2,  dest 

4 

PADD  srcl,  srcl,  dest 

4 

PSUB  srcl,  src2,  dest 

4 

CVT.S.PL/U  src,  dest 

1 

CVT.PS.S  srcl, srcl, dest 

1 

ADD_HL  src,  dest 

4 

LDS_HL  dest  index(base) 

2 

Table  1 :  Instruction  format  and  latency.  All  instructions  are  fully  pipelined. 


other  benchmarks.  Awadvs  has  very  small  average  number  of  vertices  (3.4)  per  glBegin/glEnd  pair, 
which  implies  that  switching  between  pipeline  stages  and  glBegin/glEnd  pairs  occurs  more  frequently. 


3  SIMD  Instruction  Extensions 

From  the  benchmark  profiling  discussed  in  the  previous  section,  we  observe  that  most  of  the  arith¬ 
metic  floating-point  instructions  are  multiply  and  add,  and  these  operations  are  all  performed  on  sin¬ 
gle-precision  values  (32-bit).  Thus,  the  SIMD  type  instructions  that  perform  multiply  or  add  opera¬ 
tions  on  two  single-precision  floating  -point  values  could  fully  utilize  the  64-bit  floating  point  registers 
in  current  superscalar  processors  and  potentially  eliminate  a  significant  number  of  instructions.  The 
MIPS  V  ISA  Extension  [14]  proposes  adding  a  new  data  type  called  paired-single,  which  packs  two 
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single  precision  floating-point  values  into  one  64-bit  floating-point  register.  The  multiply  and  addition 
operations  are  performed  on  the  paired-single  data  in  the  manner  illustrated  in  Figure  4. 

The  SIMD  instruction  extensions  that  we  consider  in  this  paper  are  based  on  the  MIPS  V  ISA  Ex¬ 
tensions  [15].  The  instruction  formats  and  latency  assumptions  are  summarized  in  Table  1.  The  LDPS 
and  STPS  instructions  load/store  a  paired-single  value  (64  bits)  from  memory  ignoring  alignment.  The 
PMUL  (PADD/PSUB)  instruction  performs  multiplication  (addition/subtraction)  of  paired-single  val¬ 
ues.  These  paired-single  instructions  have  4  cycle  latency  and  are  fully  pipelined.  CVT.S.PL 
(CVT.S.PU)  is  used  to  extract  the  lower(higher)  part  of  a  paired-single  value,  and  CVT.PS.S  is  used  to 
create  a  paired-single  value  from  two  single-precision  values. 

The  ADD_HL  and  LDS_HL  instructions  are  not  present  in  the  MIPS  V  instruction  extensions. 
ADD_HL  adds  the  higher  and  lower  parts  of  a  paired-single  value  together.  One  example  to  show  the 
usefulness  of  the  ADD_HL  instructions  is  the  inner  product  operation  commonly  seen  in  the  lighting 
stage.  The  inner  product  of  two  vectors  (xl,  yl,  zl)  and  (x2,  y2,  z2)  is  xl*x2  +  yl*y2  +  zl*z2.  The 
first  two  multiplication  operations  can  be  paired  up.  But  the  results  must  be  added  together.  Without 
the  ADD_HL  instruction,  we  need  to  use  the  CVT.S.PU  or  CVT.S.PL  instruction  to  extract  the  higher 
or  lower  half  to  a  separate  register  before  performing  the  addition. 

The  LDS_HL  instruction  duplicates  a  single-precision  value  to  form  a  paired-single  value.  We  use 
the  computation  of  transforming  normal  vectors  (multiplication  of  a  1x3  vector  and  3x3  matrix)  to  il¬ 
lustrate  the  use  of  this  instruction.  The  pseudo  C  codes  are  as  follows  (u  and  m  represent  an  array  of 
vertex  coordinates  and  the  transformation  matrix  respectively): 
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for  (i=0;i<  number  of  vertices ;i++) 

{  q[i] [0]  =u[i][0]  *  m[0,0]+u[i][l]  *  m[l,0]+u[i][2]  *  m[2,0]; 
q[i][l]  =u[i] [0]  *  m[0,l]+u[i][l]  *  m[l,l]+u[i][2]  *  m[2,l]; 
q[i][2]  =u[i][0]  *  m[0,2]+u[i][l]  *  m[l,2]+u[i][2]  *  m[2,2]; 

} 

To  exploit  the  parallelism  across  two  vertices,  we  unroll  the  loops  once  and  reorder  instructions 
such  that  the  independent  floating-point  operations  can  be  easily  paired  up.  The  modified  version  is 
listed  below: 


for  (i=0;i<  number  of  vertices ;i=i+2) 

{ 

q[i][0]  =  u[i][0]  *  m[0,0]+u[i][l]  *  m[l,0]  +  u[i][2]  *m[2,0]; 

q[i+l][0]  =  u[i+l][0]  *  m[0,0]  +  u[i+l][l]  *  m[l,0]  +  u[i+l][2]  *  m[2,0]; 

q[i][0]  =  u[i][0]  *  m[0,l]+u[i][l]  *  m[l,l]  +  u[i][2]  *m[2,l]; 

q[i+l][0]  =  u[i+l][0]  *  m[0,l]  +  u[i+l][l]  *  m[l,l]  +  u[i+l][2]  *  m[2,l]; 

q[i][0]  =u[i][0]  *m[0,2]  +  u[i][l]  *  m[l,2]  +  u[i][2]  *m[2,2]; 

q[i+l][0]  =  u[i+l][0]  *  m[0,2]  +  u[i+l][l]  *  m[l,2]  +  u[i+l][2]  *  m[2,2]; 

} 


To  perform  the  paired-single  multiplication  over  vertices  i  and  i+1,  we  need  to  form  paired-single 

values  for  each  element  of  the  transformation  matrix  (i.e.,  (m[0,0],m[0,0]),  (m[l,0],m[l,0]) . ).  The 

instruction  LDS_HL  is  used  for  this  purpose.  Without  the  LDS_HL  instruction,  it  will  require  one  load 
and  CVT.PS.S  instruction  to  form  each  pair. 


4  Experimental  Methodology 

In  this  section,  we  describe  the  simulation  environment  and  processor  models  considered  in  this  pa- 


4.1  Simulation  Framework 
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Figure  5:  Simulation  framework. 

libGEOM.so  is  a  shared  library  including  all  the  routines  associated  with  the  geometry 
processing.  We  only  instrument  code  in  the  highlighted  boxes. 


Our  simulation  environment  (shown  in  Figure  5)  uses  ATOM  [25]  to  perform  execution-driven 
simulation.  This  simulation  framework  consists  of  two  components.  The  first  component  is  MESA,  a 
software  implementation  of  the  OpenGL  specification.  A  shared  library  that  contains  all  of  the  rou¬ 
tines  associated  with  geometry  computation  is  separated  from  the  complete  MESA  implementation. 
ATOM  allows  us  to  instrument  only  this  geometry  library  and  the  application  itself.  In  this  way,  we 
can  simulate  the  environment  where  the  host  CPU  is  responsible  for  database  traversal  and  geometry 
processing,  while  specialized  hardware  is  used  to  process  the  remaining  tasks  in  the  graphics  pipeline. 
We  modified  four  routines,  which  account  for  75%  to  90%  of  the  total  execution  time  for  all  the 
benchmarks  we  ran,  to  incorporate  paired-single  instructions.  These  routines  correspond  to  the  Light, 
XformV,  XformN  and  Div  by  w/Map  as  described  in  Section  2.2. 

The  second  component  of  our  simulation  framework  is  an  ATOM-based  simulator  that  models  an 
out-of-order  superscalar  processor  with  speculative  execution  [4],  whose  instruction  set  is  based  on  the 
DEC  Alpha  processor  [24]. 
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Processor 

Model 

Total 

Issue  Width 

#  of  Integer  Functional  Units 

#  of  Floating-Point  Functional  Units 

Issue 

Limit 

loads& 

stores 

Control 

flow 

other 

Issue 

Limit 

mul 

div 

sqrt 

other 

Base 

4 

4 

2 

2 

4 

2 

1 

1 

1 

1 

2xBase 

8 

8 

4 

4 

8 

4 

2 

1 

1 

2 

4xBase 

16 

16 

8 

8 

16 

8 

4 

1 

1 

4 

2xFP 

6 

4 

2 

2 

4 

4 

2 

1 

1 

2 

4xFP 

10 

4 

2 

2 

4 

8 

4 

1 

1 

4 

Table  2:  Instruction  issue  rules  (ready  instructions  are  issued  in  fetch  order). 


To  simulate  the  new  instructions,  we  place  innocuous  (but  unique)  “marker”  instructions  where  we 
want  to  replace  the  original  code  with  new  instructions.  The  operands  of  the  marker  instructions  indi¬ 
cate  different  instruction  types  (e.g.,  LDPS,  MULPS,  etc).  The  appropriate  operands  of  the  new  in¬ 
structions  are  passed  through  the  next  2  or  3  marker  instructions,  depending  on  the  number  of  the  op¬ 
erands  required.  In  this  way,  instruction  dependencies  are  accurately  maintained.  The  simulator  de¬ 
codes  each  instruction  and  takes  appropriate  actions  to  simulate  the  paired- single  execution  when  it 
encounters  the  marker  instruction. 


4.2  Processor  Models 

The  baseline  processor  model  studied  in  this  paper  is  a  4-way,  out-of-order  issue  superscalar  proc¬ 
essor.  The  issue  rules  and  the  functional  unit  latencies  are  summarized  in  Table  2  and  Table  3.  The 
maximum  number  of  instructions  that  can  be  inserted  into  the  dispatch  queue  or  committed  is  equal  to 
the  issue  width.  When  an  instruction  is  inserted  into  the  dispatch  queue,  its  destination  register  is 
mapped  to  a  physical  register  and  a  reorder  buffer  entry  is  allocated.  Once  an  instruction  is  issued,  it  is 
removed  from  the  dispatch  queue,  but  the  register  mapping  remains  active  until  this  instruction  com¬ 
mits  in  program  order  from  the  reorder  buffer  [5].  Note,  the  reorder  buffer  size  is  determined  by  the 
number  of  physical  registers.  We  implement  a  precise  exception  model,  so  an  instruction  can  only 
commit  when  all  the  instructions  preceding  it  in  program  order  have  completed.  We  assume  a  perfect 
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Instruction  r 

type 

latency 

pipeline 

Integer 

multiplication 

6 

yes 

load 

2 

yes 

store 

1 

yes 

control  flow 

1 

yes 

other 

1 

yes 

Floating¬ 

point 

32-bit  div 

8 

no 

64-bit  div 

16 

no 

square  root 

33 

no 

other 

4 

yes 

Table  3:  Instruction  latencies. 


memory  system  (i.e.,  every  memory  reference  and  instruction  fetch  hit  in  the  LI  cache)2,  a  unified  dis¬ 
patch  queue  and  separate  register  files  for  the  integer  and  floating-point  functional  units.  Speculative 
execution  is  enabled  by  implementing  the  branch  prediction  scheme  proposed  by  McFarling  [11]  and 
precise  exceptions  are  imposed. 

To  investigate  the  effect  of  a  wider  issue  superscalar  processor  on  the  performance  of  geometry 
processing,  we  examine  the  following  4  models:  2xBase,  4xBase,  2xFP  and  4xFP  as  listed  in  Table  2. 
The  2xBase  and  4xBase  models  are  8-way  and  16-way  issue  processors  respectively.  The  issue  rules 
are  similar  to  the  Base  model.  However,  for  most  instruction  types,  two  or  four  times  the  number  can 
be  issued  in  one  cycle.  The  exceptions  are  division  and  square  root,  which  remain  the  same  as  the 
baseline  model.  The  reason  for  not  doubling  these  two  functional  units  is  for  a  fair  performance  com¬ 
parison  between  2xFP  and  a  baseline  processor  with  the  paired-single  instruction  since  the  paired- 
single  operations  are  not  implemented  for  division  and  square  root.  For  the  2xFP  (4xFP)  configura¬ 
tions,  we  double  (quadruple)  only  the  floating-point  functional  units  and  issue  width.  The  number  of 


2  The  miss  rates  for  a  64-K,  2-way  set  associative  D-cache  and  8-K  direct-mapped  I-cache  are  both  less  than  2%  for  most 
of  the  benchmarks.  Thus,  we  assume  a  perfect  memory  system  to  reduce  simulation  time. 
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the  integer  functional  units  remains  the  same  as  the  baseline  processor.  Then  total  issue  width  be¬ 
comes  6  and  10  for  2xFP  and  4xFP  respectively. 

5  Simulation  Results 

We  use  CDRS,  Awadvs  and  DX  from  Viewperf  as  our  benchmarks  due  to  lengthy  simulation  time. 
Each  of  these  benchmarks  is  composed  of  several  tests.  For  space  reasons,  we  only  present  testl  from 
each  benchmark.  These  three  tests  are  chosen  because  they  are  representative  of  all  the  tests  (the  com¬ 
plete  simulation  results  can  be  found  in  [27]).  CDRS  testl  is  a  wireframe  rendering  application  and 
both  DX  and  Awadvs  have  at  least  one  light  source.  Awadvs  has  only  3.4  vertices  on  average  per 
glBegin/glEnd,  while  CDRS  and  DX  testl  have  30  and  96  vertices  respectively. 

We  present  our  simulation  results  in  three  parts.  First,  we  investigate  how  well  conventional  super¬ 
scalar  processors  exploit  the  parallelism  in  geometry  processing.  Then  we  present  the  performance  of 
paired-single  execution  on  different  processor  models.  Finally,  we  compare  the  relative  performance 
of  different  processors  with  and  without  paired-single  instructions  accounting  for  potential  increases  in 
CPU  clock  cycle  time. 

5.1  Scaling  a  Conventional  Design 

The  dispatch  queue  and  register  file  sizes  have  significant  impact  on  how  much  ILP  can  be  ex¬ 
ploited  in  a  superscalar  processor.  A  wider  issue  machine  usually  requires  a  larger  dispatch  queue  and 
register  file.  In  order  to  evaluate  the  potential  performance  improvement  achieved  by  increasing  the 
issue  width,  the  superscalar  simulator  is  first  configured  with  2048  floating-point  and  2048  integer  reg¬ 
isters.  With  such  a  large  register  file,  the  CPU  never  stalls  due  to  a  lack  of  free  registers.  We  then  vary 
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Figure  6:  The  commit  IPC  of  various  processor 
models  with  varying  dispatch  queue  size. 

(4xBase*  represents  the  issue  IPC  of  the  4xBase  processor) 


the  dispatch  queue  size  from  64  to  256.  The  commit  IPC3  for  the  various  processor  models  is  shown  in 
Figure  6. 

CDRS  testl  has  the  highest  IPC  for  all  the  configurations  among  all  the  benchmarks  we  ran.  With 
the  largest  dispatch  queue  (256),  the  commit  IPC  of  2xBase  (8-way  issue)  is  6.5,  almost  twice  that  of 
Base  (3.4).  Doubling  only  the  floating-point  issue  width  (2xFP)  achieves  36%  performance  improve¬ 
ment.  However,  quadrupling  only  the  floating-point  issue  width  (4xFP)  does  not  perform  any  better 
than  the  2xFP  because  the  loads  that  read  the  source  operands  for  the  floating-point  operations  become 
the  bottleneck.  The  commit  IPC  of  the  4xBase  processor  (16-way  issue)  is  9.7,  about  2.7  times  that  of 
Base. 
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The  continuous  growth  of  commit  IPC  as  the  issue  width  increases  indicates  that  a  lot  of  parallelism 
does  exist  in  geometry  processing  for  this  benchmark.  Note  that  the  commit  IPC  grows  with  larger 
dispatch  queue  size,  but  the  degree  of  improvement  diminishes  after  a  certain  size.  This  point  occurs 
around  a  dispatch  queue  size  of  32  for  the  Base  model,  64  for  both  the  2xBase  and  2xFP,  and  128  for 
4xBase. 

For  DX  testl,  the  commit  IPC  of  the  processor  models  smaller  than  4xBase  are  comparable  to 
CDRS  testl,  except  for  2xFP,  which  only  achieves  13%  performance  improvement.  The  ratio  of  float¬ 
ing-point  arithmetic  operations  to  load  instructions  is  1:1  for  DX  testl  and  2:1  for  CDRS  testl.  Thus, 
increasing  the  floating-point  issue  width  alone  does  not  improve  the  performance  of  DX  as  much  as 
that  of  CDRS.  For  DX  testl,  the  commit  IPC  of  4xBase  is  8.48,  lower  than  CDRS  testl  (9.7).  The 
lower  commit  IPC  is  due  to  more  mispredicted  branches.  The  issue  IPC  for  4xBase  is  plotted  in  Figure 
6  to  illustrate  this  scenario.  Issued  instructions  can  not  commit  if  a  preceding  branch  in  the  program 
order  was  mispredicted.  DX  testl  has  a  larger  difference  between  the  issue  and  commit  IPC  than 
CDRS  testl.  This  is  because  the  lighting  computation  has  more  conditional  branches  than  the  trans¬ 
form,  thus  the  performance  of  a  light-intensive  applications  like  DX  testl,  is  more  subject  to  branch 
prediction  accuracy  than  a  wireframe  rendering  application  like  CDRS  testl. 

Awadvs  testl  has  the  lowest  IPC,  primarily  because  of  its  small  number  of  vertices  (3.4)  per 
glBegin/glEnd.  Observe  that  the  commit  IPC  increases  linearly  with  the  dispatch  queue  size,  hence,  for 
this  benchmark,  the  dispatch  queue  is  still  the  bottleneck  even  when  it  has  256  entries.  Because  the 
4xFP  processor  performs  equal  to  2xFP  for  all  the  benchmarks  we  ran,  we  no  longer  consider  this  con¬ 
figuration  in  the  following  analysis. 
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The  commit  IPC  is  the  ratio  of  the  number  of  instructions  that  commit  to  the  total  execution  cycles. 
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Figure  7:  The  commit  IPC  for  various  processor 
models  with  varying  register  file  size. 


To  evaluate  how  the  register  file  size  affects  performance,  we  keep  the  dispatch  queue  size  constant 
(64  entries  for  Base,  2xFP  and  2xBase  and  128  entries  for  4xBase)  while  varying  the  register  file  size 
from  64  to  256.  The  results  are  shown  in  Figure  7.  The  2048  entry  register  file  size  is  shown  as  a  refer¬ 
ence  point.  Using  more  than  128  registers  for  Base,  2xFP  and  2xBase  and  256  registers  for  4xBase 
does  not  improve  performance  significantly. 

In  the  next  section,  we  analyze  the  benefit  of  the  paired-single  execution  on  the  Base,  2xFP  and 
2xBase  processors.  4xBase  is  a  16-way  issue  machine  and  requires  a  128-entry  dispatch  queue  and 
256  registers.  This  configuration  is  too  large  to  achieve  a  practical  implementation  by  simply  scaling 
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CDRS 

AW 

Inst 

type 

Inst  count 

Reduction 

% 

Inst  count 

Reduction 

% 

Orig 

Pair 

Orig 

Pair 

add 

7645898 

3965927 

48 

18884890 

11315926 

40 

sub 

5141 

5141 

0 

2195114 

2073434 

6 

mul 

10760847 

5666089 

47 

29192094 

18531619 

37 

Id 

8052969 

7625453 

5 

43112841 

36494561 

15 

sts 

3936594 

2521807 

36 

16402547 

12928214 

21 

cvt.s.pl/u 

0 

0 

- 

0 

65905 

- 

cvt.ps.s 

0 

0 

- 

0 

1111734 

- 

other 

31413092 

31413092 

0 

139816628 

139816628 

0 

total 

61809400 

51192400 

17 

247409000 

220068000 

11 

DX 

Table  4:  Instruction  distributi 
paired  and  paired-single  exec 

Inst 

type 

Inst  count 

Reduction 

% 

Orig 

Pair 

add 

7577899 

4293152 

43 

sub 

4879 

4879 

0 

mul 

11572039 

7058061 

39 

Id 

21965717 

19932776 

9 

sts 

10809407 

9211454 

15 

cvt.s.pl/u 

0 

93461 

- 

cvt.ps.s 

0 

80171 

- 

other 

109218938 

109218938 

0 

total 

161144000 

149823000 

7 

the  Base  configuration,  hence  we  do  not  consider  it  further. 

5.2  The  Performance  Improvement  of  Paired-Single  Execution 

Adding  paired-single  instructions  not  only  reduces  the  number  of  single-precision  floating-point  add 

subtraction,  and  multiply  instructions,  it  can  also  eliminate  load/store  instructions  if  the  LDPS  instruc¬ 
tion  can  be  used  to  load  two  single-precision  floating-point  values  together.  Table  4  shows  the  instruc¬ 
tion  distribution  for  both  non-paired  and  paired-single  execution.  The  number  of  multiply  and  add  in¬ 
structions  is  reduced  by  approximately  50%  for  CDRS  testl,  40%  for  DX  and  Awadvs  testl.  The 
number  of  load  and  store  instructions  are  reduced  by  5%  to  15%  and  15%  to  36%,  respectively.  Recall 
that  paired-execution  requires  extra  instructions  (CVT.S.PL/U  and  CVT.PS.S)  to  create  a  paired-single 
value  or  extract  the  lower  (higher)  part  of  a  paired-single  value.  However,  for  the  benchmarks  we 
tested,  they  are  negligible.  They  account  for  less  than  1%  of  all  instructions  for  DX  and  Awadvs  testl, 
and  CDRS  testl  does  not  use  these  instructions.  All  paired-single  values  are  created  using  the  LDPS 
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Processor  models 

Figure  8:  Performance  improvement  of  paired-execution. 


instruction.  After  taking  into  account  these  extra  instructions,  the  overall  instruction  reduction  is  17% 
for  CDRS,  1 1%  for  Awadvs  and  7%  for  DX. 

Reducing  the  number  of  instructions  has  two  potential  advantages.  First,  combining  two  floating¬ 
point  operations  together  effectively  enables  the  CPU  to  look  further  ahead  to  find  independent  in¬ 
structions  to  issue.  In  other  words,  adding  paired-single  instructions  could  achieve  the  same  effect  as 
increasing  the  dispatch  queue  size.  Second,  it  can  improve  the  instruction  cache  performance.  We  did 
not  analyze  this  due  to  the  low  instruction  cache  miss  rate  for  the  benchmarks  we  ran.  We  can  expect 
higher  performance  impact  on  an  embedded  system,  which  is  usually  configured  with  a  smaller  in¬ 
struction  cache. 

We  evaluate  the  performance  improvement  of  paired- single  execution  on  the  Base,  2xFP  and 
2xBase  models.  The  simulation  results  are  shown  in  Figure  8.  The  y-axis  shows  the  speedup  of  paired- 
single  over  non-paired  execution.  CDRS  testl  has  the  best  performance  improvement,  28%  on  the 
Base  model,  13%  on  the  2FP  and  20%  on  the  2xBase.  DX  testl  has  the  smallest  performance  im¬ 
provement  since  it  only  reduces  the  number  of  instructions  by  7%.  Note  that  our  speedups  may  not  be 
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Figure  9:  Relative  speedup  of  various  processor  models  over  the 
Base  with  a  dispatch  queue  of  64  entries  and  128  registers. 


optimal.  First,  there  are  some  routines  required  for  geometry  processing  that  we  have  not  converted  to 
use  paired-single  instructions.  However  the  impact  on  performance  of  these  procedures  should  not  be 
substantial.  Second,  we  have  not  optimized  the  instruction  schedule  of  the  paired-single  sequence. 
Different  computation  sequences  incur  different  register  allocation  and  instruction  scheduling,  and 
analyzing  these  effects  requires  further  research. 

5.3  Paired-Single  vs.  Wider  Issue 

In  this  section,  we  discuss  the  relative  performance  of  various  processor  models  with  and  without 
the  paired-single  instruction  set.  First,  we  compare  relative  performance  assuming  that  CPU  cycle  time 
remains  the  same  for  all  processor  models.  Then,  we  investigate  how  changes  in  cycle  time  affect 
overall  performance.  All  the  processor  models  are  configured  with  a  64-entry  dispatch  queue  and  128 
registers.  These  numbers  are  chosen  such  that  the  performance  of  an  8-way  issue  (2xBase)  processor 
is  not  constrained  too  much  by  the  dispatch  queue  and  register  file  and  the  processor  configuration  is 
within  a  reasonable  range. 

The  simulation  results,  assuming  no  changes  in  CPU  cycle  time,  are  shown  in  Figure  9.  The  y-axis 
is  the  speedup  of  the  various  processor  models  over  the  Base  configuration.  Adding  paired-single  in- 
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structions  effectively  doubles  the  floating-point  issue  width  so  a  Base  processor  with  the  paired-single 
instruction  extension  can  potentially  achieve  the  same  floating-point  processing  capability  as  2xFP. 

Our  results  show  that  Base+Pair  performs  within  7%  of  2xFP  for  CDRS  and  DX  and  it  even  outper¬ 
forms  2xFP  for  Awadvs  testl.  Besides  the  advantage  of  doubling  floating-point  processing  rate,  add¬ 
ing  paired-single  instructions  can  better  utilize  the  dispatch  queue,  as  mentioned  in  the  previous  sec¬ 
tion.  For  an  application  where  the  dispatch  queue  is  the  performance  bottleneck,  like  Awadvs  testl, 
Base+Pair  has  a  performance  advantage  over  2xFP.  An  8-way  issue  processor  using  paired-single  in¬ 
structions  (2xBase+Pair)  can  achieve  1.9  speedup  over  Base  for  CDRS  testl. 

5.3.1  Effects  on  Clock  Cycle  Time 

Previous  studies  have  shown  that  increasing  issue  width  has  significant  impact  on  the  processor  cy¬ 
cle  time  [4]  [21].  Palacharla  et  al.  [21]  studies  how  the  instruction  dispatch,  issue  logic  and  data  bypass 
delay  varies  with  different  issue  width.  Their  results  show  that  the  issue  logic  determines  the  critical 
path  delay  in  a  0.35um  technology  for  both  4-way  and  8-way  issue  processors  (not  considering  cache 
and  register  files)  and  the  wakeup  logic  delay  (part  of  issue  logic)  grows  linearly  with  the  issue  width. 
Farkas  et  al.  [4]  shows  that  the  issue  width  determines  the  number  of  read/write  ports  to  a  register  file, 
and  thus  can  have  significant  impact  on  the  cycle  time. 
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Figure  10:  Relative  performance  of  various  Table  5:  Cycle  time  increase  over  the  Base 

processor  models  assuming  that  the  window  model  assuming  0.35um  technology, 

issue  logic  determines  processor  cycle  time. _ 


The  percentage  increase  of  issue  logic  delay  and  register  file  access  time  over  the  Base  model  are 
summarized  in  Table  5.  We  use  a  modified  version  of  CACTI  [8]  developed  by  K.  Farkas  in  [4]  to 
generate  the  register  file  access  time.  Note  that  the  floating-point  register  file  of  2xFP  has  the  same 
number  of  read/write  ports  as  the  integer  register  file  of  Base.  Thus,  the  register  file  access  time  of 
2xFP  is  equal  to  the  Base  access  time.  The  2xBase  model  increases  the  register  file  cycle  time  by  50% 
in  a  0.35mu  technology.  We  use  the  data  provided  in  [22]  to  derive  the  issue  logic  delay.  Linear  ex¬ 
trapolation  is  used  to  obtain  the  data  for  configurations  not  studied  in  that  paper.  The  2xFP  model  in¬ 
creases  the  issue  logic  delay  by  7%,  and  2xBase  by  14%. 

We  present  simulation  results  in  two  sets.  The  first  set  assumes  that  issue  logic  (wakeup+selection) 
determines  the  critical  path  delay  (Figure  10)  and  the  second  set  assumes  that  register  file  access  does 
(Figure  11).  The  y-axis  is  the  speedup  of  the  various  processor  models  over  the  Base  configuration.  If 
the  issue  logic  determines  critical  path  delay,  Base+Pair  outperforms  2xFP  for  all  the  benchmarks.  The 
performance  difference  is  most  substantial  for  Awadvs  testl.  The  2xBase+Pair  processor  model 
achieves  1.6  speedup  for  CDRS  testl.  However,  if  the  CPU  cycle  time  is  determined  by  the  register 
file  access,  increasing  the  issue  width  up  to  8-way  (2xBase)  has  negative  impact  on  the  performance 
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Figure  11:  Relative  performance  of  various  processor  models  as¬ 
suming  that  the  register  file  access  delay  determines  processor 
cycle  time. 


for  Awadvs  and  DX  testl  as  shown  in  Figure  11.  Thus,  2FP+Pair  becomes  the  best  design  choice, 
achieving  1.5  speedup  over  the  Base  processor  model  for  CDRS  and  1.2  for  both  Awadvs  and  DX. 

6  Performance  Comparison:  General  Processors  v.s.  High-End  Systems 

To  get  an  idea  how  well  a  general  processor  can  execute  the  geometry  pipeline,  we  compare  the  per¬ 
formance  of  a  general  processor  with  high-end  graphics  systems  in  terms  of  number  of  frames  per  sec¬ 
ond  (the  high  end  performance  numbers  are  obtained  from  [19]).  Note  that  the  general  processor  per¬ 
formance  shown  here  is  optimistic  because  we  do  not  take  into  account  some  system  components  such 
as  TLB,  instruction  cache,  CPU  cycle  time  effects,  etc.  nor  do  we  scale  the  technology  for  the  custom 
systems.  Nonetheless,  this  comparison  provides  a  general  idea  of  the  relative  performance.  This  com¬ 
parison  is  shown  in  Figure  12  for  testl  from  CDRS,  Awadvs,  and  DX.  We  look  at  three  general  proc¬ 
essor  models,  Base,  Base+Pair  and  2xBase+Pair  assuming  the  same  CPU  clock  cycle  time.  The  high- 
end  systems  that  we  compared  to  are  HP  Visualize  fx6  for  CDRS  and  SGI  Infinite  Reality  for  Awadvs 
and  DX.  These  two  systems  achieve  the  highest  benchmark  results. 
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Figure  12:  Performance  comparison:  general 
processor  v.s.  high-end  graphics  systems 


Assuming  a  500MHZ  CPU  clock  rate,  adding  paired-single  instructions  on  a  4-way  issue  processor 
(Base+Pair)  achieves  performance  close  to  the  high-end  system  for  CDRS  testl  (250  v.s.  290  frames 
per  second).  On  an  8-way  issue  processor  with  paired-single  instructions  (2xBase+Pair),  a  general 
processor  can  even  outperform  the  high-end  system.  However,  a  general  processor  performs  less 
effectively  for  lighting-intensive  applications  like  DX  and  Awadvs.  For  these  two  benchmarks,  a 
2xBase+Pair  processor  performs  within  60%  to  70%  of  the  high-end  system  if  CPU  clock  rate  is 


27 


500MHZ.  To  predict  the  future  general  processor  performance  as  the  process  technology  progresses, 
we  look  at  different  CPU  clock  rates.  For  Awadvs  testl,  a  700MHZ,  2xBase+Pair  processor  can 
achieve  performance  similar  to  the  current  high-end  system.  For  DX  testl,  a  2xBase+Pair  processor 
still  performs  slightly  lower  than  the  current  high-end  system  even  the  CPU  clock  rate  reaches 
800MHZ. 

7  Conclusion 

The  widespread  use  of  multi-media  applications  presents  new  design  challenges  for  system  design¬ 
ers.  In  this  paper,  we  examine  the  performance  of  geometry  computation  in  three  dimensional  graph¬ 
ics  applications  on  future  superscalar  processors.  Geometry  computation  is  single-precision  (32-bit) 
floating-point  intensive.  We  investigate  the  performance  of  recently  proposed  instructions  that  per¬ 
form  two  independent  32-bit  operations  by  packing  the  operands  in  64-bit  registers  and  exploiting  the 
existing  64-bit  datapath.  We  use  simulation  to  compare  the  performance  of  these  new  instructions, 
called  paired-single  to  that  achieved  by  increasing  a  conventional  out-of-order  processor's  issue-width. 

From  our  simulation  results,  we  found  that  paired-single  instructions  improve  performance  by  up  to 
28%  on  a  4-issue  processor  and  20%  on  an  8-issue.  These  improvements  are  comparable  to  those 
achieved  by  doubling  only  the  floating-point  issue  width  (2xFP).  Our  results  reveal  that  4xFP  per¬ 
forms  equal  to  2xFP  because  load  instructions  that  read  source  operands  of  floating-point  operations 
become  the  bottleneck,  and  hence  require  a  commensurate  increase  in  the  integer  issue  width. 

We  also  found  that  the  average  number  of  vertices  processed  in  each  stage  of  the  geometry  pipeline 
(i.e.,  vertices  per  glBegin/glEnd)  is  the  primary  factor  determining  performance  on  superscalar  proces¬ 
sors.  For  benchmarks  that  have  a  large  number  of  vertices  per  glBegin/glEnd,  the  speedup  of  an  8-way 
issue  processor  over  a  4-way  is  1.6  with  a  64-entry  dispatch  queue  and  128  registers.  However,  for 
benchmarks  that  have  a  small  average  number  of  vertices  per  glBegin/glEnd,  the  speedup  is  only  1.2. 
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Considering  the  impact  of  the  issue  width  on  the  CPU  cycle  time,  we  looked  at  two  pipeline  stages 
that  can  be  on  the  critical  timing  path.  One  is  the  register  file  access  and  the  other  is  the  issue  logic.  If 
the  issue  logic  is  on  the  critical  path,  an  8-way  issue  processor  with  paired-single  instructions  provides 
20%  to  65%  performance  improvement  over  a  4- way  issue.  However,  if  the  register  file  access  is  on 
the  critical  path,  the  processor  cycle  time  increases  by  almost  50%  going  from  4-way  to  8-way.  Thus 
doubling  only  the  floating-point  issue  width  of  a  4-issue  processor  (2xFP)  with  paired-single  instruc¬ 
tions  becomes  the  best  design  choice.  The  improvement  over  a  4-way  issue  processor  ranges  from 
20%  to  50%. 
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